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Abstract 


Optimum designs for parameter estimation in generalized regression mod¬ 
els are standardly based on the Fisher information matrix (cf. Atkinson et al. 

(2014) for a recent exposition). The corresponding optimality criteria are re¬ 
lated to the asymptotic properties of maximal likelihood (ML) estimators in 
such models. However, in finite sample experiments there could be problems 
with identifiability, stability and uniqueness of the ML estimate, which are 
not reflected by the information matrices. In Pazman and Pronzato (2014) is 
discussed how to solve some of these estimability issues on the design stage of 
an experiment in standard nonlinear regression. Here we want to extend this 
design methodology to more general models based on exponential families of 
distributions (binomial, Poisson, normal with parametrized variances, etc.). 

The main tool for that is the information (or Kullback-Leibler) divergence, 
which is closely related to the ML estimation. 

Keywords: Exponential families, stability of MLE, Kullback-Leibler divergence, 
optimality criteria. 

1 Introduction 

To each design point x G V, the design space, we associate an observation y (a ran¬ 
dom variable or vector), which is distributed according to the density of an expo¬ 
nential form 
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with the unknown parameter 9 taking values from a given parameter space 0 C M p . 
This density is taken with respect to a measure v (•) on Y, the sample space of y. 
Usually Y C and v is the Lebesgue measure, or Y is finite or countable and 
v ({?/}) = 1 for every y 6 Y; then f (y \ 9, x) is simply the probability of y. Well 
known examples are the one-dimensional normal density 


f(y\9,x) = 


exp 


[y - y (x, i 


(27r) 1 / 2 a (x, 6) 
or the binomial probability distribution 

f(y \9,x)= Q 7T (x, 9) y (l-vr (x, 9)) 


y e 


2cr 2 (x, 9) 

n ~ y - y G {0,1,..., 77,} . 


( 2 ) 


Consider an exact design X = (x±, ... , xn ), where x* G X and the observations 
y x y XN are independent. The ML estimator for 9 is 9 = argmaxgg© YliLi In / (y Xi \ 9, x*). 
For large N, and under some regularity assumptions, 9 is approximately distributed 
normally with mean 9 and variance il / _1 (X, 9), where M (X, 9) = YliLi M {x t . 9), 

and M (x, 9) = Eg c) ^qqqqt'^ ^ is the elemental information matrix at x (cf. Atkin¬ 
son et al. (2014) for this terminology). Hence within this asymptotic approximation, 
a design X is (locally) optimal if it maximizes $ [M (X, 0°)], where 9 0 is a guess for 
the true (but unknown) value of 9. Here $ (•) stands for det 1 ^ p (-) (or lndet(-)) in 
case of D-optimality, etc. This is the standard way to optimize designs in generalized 
regression models. 

Alternatively to the information matrix, we take here for design purposes the 
/-divergence (the information or Kullback-Leibler divergence, cf. Kullback (1997)), 
which for any two points 9°, 9 6 0 is equal to lx ( 9 °, 9) = J2iLi Ah (9°, 9) with the 
elemental /-divergence defined by 


I x (e°,9)=Ego 


\ f(y I o°,x) ' 

, n f (y\ 6, x) _ 


(3) 


As is well known, I x (9°,9) > 0 , and it is equal to zero if and only if / (y \ 9°, x) = 
f (y | 0,x). In general, the /-divergence measures well the sensitivity of the data y 
to the shift of the parameter from the value 9° to the value 6, even when 9 and 9° 
are distant, while the information matrix M (x, 9°) is doing essentially the same, 
but only for 9 which is close to 9° (see Sect. 3). Hence the /-divergence may allow a 
better characterization of the statistical properties of the model than the information 
matrix. An important fact is also that we can compute it easily (avoiding integrals) 
in models given by (1). Notice that for the normal model with cr 2 (x,9) = 1, one 
has 2 I x (9°, 9) = YliLi — y Ou, 9)] 2 , an expression, which is largely used 

in Pazman and Pronzato (2014). 

We note that in Lopez-Fidalgo et al. (2007) the /-divergence has been used for 
design purposes for model discrimination, which, however, is a different aim. 


2 Basic Properties of Model (1) 

It is clear that t(y) is a sufficient statistics in model (1), so we can suppose, at 
least in theory, that we observe t (y) instead of y. Denote by 77 (x, 9) its mean. For 
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7 = j(x, 0 ) we have 

V(x,9) = I t(y) exp {-i;{y) + t T (y)-f-K^)}^ {x d) du(y) = 


<Y 


8k ( 7 ) 
$7 


7=7(2;,0) 


(4) 

To be able to do this derivative at any 7 , we suppose that the set {7 : f Y exp{ (y)+ 
t T ( y ) 7 }dz/ (y) < 00 } is open. Then the model (1) is called regular, and regular mod¬ 
els are standard in applications. Taking the second order derivative in (4) we obtain 
q — Var^y [t (?/)], which for 7 = 7 (x, 9) will be denoted by £ (x, 9). By a reduc¬ 
tion of the linearly dependent components of the vector t ( y ) one can always achieve 
that £ (x, 9) is nonsingular, and we obtain from (4) that dr, QQ-f' = £ (x, 9) 4 ^^ . 

The functions 9 6 0 —> rj (x, 9) (the mean-value function) and 9 e 0 —y 7 (x, 9) 
(the canonical function) are useful dual representations of the family of densities ( 1 ) 


(cf. Efron (1978)). From ( 1 ) and (4) it follows that E Xj o 
consequently the elemental information matrix is equal to 

'din f(y | x, 0 ) 


din f(y\x,6) 
80 


= 0 , and 


M (x, 9) = Var X: g 


d9 


d V T (x, 9) E _i ^ ^ drj(x,9) 


89 


89 T 


(5) 


The elemental /-divergence is, according to (3) and (1), 

4 (0°, 0) = V T (x, 9°) [7 (x, 9°) - 7 (x, 0)] + K (7 (x, 9)) - K (7 (x, 9°)) 
For more details on exponential families see Brown (1986). 


3 Variability, Stability and /-Divergence 

In this section we consider observations according to an “exact” design X = (xi,.. ., xjv). 
The joint density of y = (y Xl ,..., y XN ) T is equal to /(y |0) = n£=i / (Vxi I Q, Xi). 

The variability of the ML estimate 9 in the neighborhood of 9, the true value 
of 0, is well expressed by the information matrix M (X, 0), since its inverse is the 
asymptotic variance matrix of 0. But the same can be achieved by the /-divergence, 
since we have in model ( 1 ) the following property of the /-divergence. 

Lemma 1. If for any x G X the third order derivatives of I x (0, 0) with respect to 
0 are bounded on a neighborhood of 0, then we have 

/ x (0,0)=i(0-0) T M(X,0)(0 — 0) +o (||0-0|| 2 ) . (6) 


It is sufficient to prove this equality for the elemental /-divergence and elemental 

_ _ QJ f Q q\ Q2 J /Q _ 

information matrix. We have I x (0,0) = 0, 7)^' - \e=e= 0, qqqqt \o=e= M (x, 0), 

so by the Tylor formula we obtain ( 6 ). 

On the other hand, we can have important instabilities of the ML estimate 0 
when, with a large probability, In / (y 10 ) is close to In / (y | 0 ) for a point 0 distant 
from 0. However, at the design stage we do not know the value of y, so we cannot 
predict the value of the difference In / (y 10) — In / (y |0). But we can predict at 
least its mean. 
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Lemma 2. For any 9 G 0 we have Eg jin / (y | 9) — In / (y |0) j = I x (9,9). 

This equality is evident from (3). 

As a consequence of Lemmas 1 and 2 we have that the /-divergence lx (S, 0 ) can 
express simultaneously both: the variability of the ML estimate 9 in a neighborhood 
of 9 and the danger of instability of the ML estimation due to the possibility of “false” 
estimates which are very distant from the true value 9 . 


4 Extended Optimality Criteria 


According to the principle mentioned in Sect. 3, the design based on the /-divergence 
should minimize the variability of 9 (related to the information matrix) and protect 
against instabilities coming from a value 9 which is distant from the true value 9 . 
This requirement can be well reflected by extended optimality criteria (cf. Pazman 
and Pronzato (2014) for the classical nonlinear regression). To any design (design 
measure or approximate design) / on a finite design space X we define the extended 
criteria in a form 




xex 


P 2 (90,9) 


+ K 




(7) 


where K > 0 is a tuning constant chosen in advance, 9 0 is a guess for the (unknown) 
value of 9, p (9°, 9) is a distance measure (a norm or a pseudonorm) not depending on 
the design /. When p (9°, 9) = H # 0 — #||, the Euclidean norm, we have the extended 
E-optimality criterion, denoted by (j) eE (/, 9°). When p (9°, 9) = \h (9°) — h (0)|, with 
h ( 9 ) e R, a given function of 9, we have the extended c-optimality criterion, denoted 
by fee (Z, 0°)- When p ( 9 °, 9) = max xe x |a (x, 9°) — a (x, 9) |, with x E X —m (x, 9) 6 
R, the regression function of interest, we have the extended G'-optimality criterion, 
denoted (j> e Q (/, 9°). Notice that usually a (x, 9) is equal to 7 (x, 9) or to p (x, 9), but 
not necessarily. The names of the criteria are justified by the following theorem. 

Theorem 1. Let B ( 9 °, S ) be a ball centred at 9 0 with the diameter S, and M ( 9 °, /) = 
T,xex M OMKOr)- Then 


lim min 21 x (9°, 9) 

<5->O0gB( 0°,<5) v ' 


1 

~pWJ) 


+ K 


Z(x) 


is equal to 

• A m in [M (0°,/)], the minimal eigenvalue, in case that p(9°,9) = \\9° — 9\\, 


[c T M~ (9°, £) c] 1 ifc = 


dh{0) 

dd 


\ e° 


is in the range of M (9°, /), or zero if it is 


not in this range, in case of p ( 9 °, 9) = \h (9°) — h 


[max x(E * f T (x) M 1 ( 9 °, /) / (a;)] 1 with f (x) = 
singular and if p ( 9 °, 9) = max Ig ^ \a (x, 9°) — a (x, 


da(x,6 ) 
89 


00 


if M (9°, /) is non- 
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When the model is normal, linear, with unit variances of observations, h ( 6 ) is linear 
function of 6 and a (x, 9) — rj (x, 6) is linear, then (j) ext (£, 9°) coincides with the 
corresponding well known optimality criterion in linear models. 

Proof. The proof follows from Lemma 1 and from known expressions: A m ; n [M] = 

/ J \2 

min u c 1 M~c = max„ : Mu ^ 0 ^ ^ (if c is in the range of M). In a normal 

linear model with unit variances we use that 7 ( x , 9) = 7 ( x , 9) = f T ( x ) 9 and 

£*=* 24 (0°, 0) e (s) = (^ - 0°) T [E. 6 at / 0*0 / T 0*0 £ 0*0] (f - 0°). □ 

If K is chosen very large, then the extended criterion gives the same optimum 
design as the classical criterion. On the other hand, when K is very small, then 
l/p 2 (9°,9) is the dominating term in (7), and we reject designs even with non- 
important instabilities at points 9 distant from 9°. 

Remark 1. We can write (7) in a form 

<Pext (f, 9°) = min V H (x, 9°, 9) f (x) , ( 8 ) 

' ' 0G0 z ' ' 7 

xex 

with an adequately chosen H (x, 9°, 9). It follows that the function £ —$■ cf ext (£, 9°) 
is concave, hence it has a directional derivative, and the “equivalence theorem” can 
be formulated, exactly as in Pazman and Pronzato (2014). 

Remark 2. I 11 the case that the design space X is finite, X = {x 1 ,..., x fc }, we can 
consider the task of computing the optimum design, f* = arg max^ f> ext (£, 9°) as an 
“infinite-dimensional” linear programming (LP) problem. Namely, we have to find 
the vector (t, f (x 1 ),..., £ (x fc )), which maximizes t under the linear constraints 

H (x, 9 °, 9 ) f (x) > t for every 9 6 0, 

x&X 

f (x) = 1, f (x) > 0 for every x G X . 

x€X 

Like in Pazman and Pronzato (2014), this can be approximated by iterative LP 
problems, including a stopping rule, which may be even more practical than the 
classical “equivalence theorem” (see the Numerical example below). 

Illustrative Example Consider the binomial model (2), which can be written in 
the exponential form ( 1 ): 

/ (y | 9, x) = exp |ln + 2/7 (x, 9) — n In (l + e 7 ^’ 6 ^) 

with 7 (x, 9) = In [ 7 r (x, 9) / (1 — 7 r (x, 6 1 ))], and with the mean of y — t(y) equal to 
77 (x, 9) = mr(x,9) = ne 7 ^’^/ (l + e 7 ^’ 6 ^) (the logistic function). In the example 
we took n = 10, and we considered the regression model (similar to that in Pazman 
and Pronzato (2014)) 

7 (x, 9) = 2 cos (t — u9 ); x — (t, u) T , 
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with two observations, one at x\ = (0 ,m) t and the second at X 2 = 


where u E [0, -^- 7 r] is to be chosen optimally for the estimation of the unknown 
parameter 9 E [0,1]. For the case of u — ^- 7 T we can see the circular “canoni¬ 
cal surface” j(y (xi, 9 ), 7 (x 2 , 9)) T ; 9 E [0,1] j in Fig. la and the “expectation sur¬ 
face” |(7 (xi, 9 ), 7 (x 2 , 0)) T ; 9 E [0,1]| in Fig. lb (which is no more circular due 

to the nonlinearity of the logistic function). The information “matrix” M u ( 9 ) = 
M (x\,9) + M (x 2 ,9) is computed according to (5), and for 9° = 0 it is equal to 
M u ( 9 °) = nu 2 . It follows that the classical locally optimal design maximizing 
M u (#°) is obtained when u = ^ 7 r. We see even from Fig. la and Fig. lb that 
under this design, the ML estimate 9 ( y ) can be, with a large probability, in the 
neighborhood of 6 — 1, hence the estimator is instable when 9° — 9 — 0. 

On the other hand, take the /-divergence I (9°, 9; u) = I Xl ( 9 °, 9) + I X2 (9°, 9) with 


I x (9°,9)=n 


n (x 


;, d 0 ) In 


7T (x, 9°) 


+ (l — 7T (x, 9 °)) In 


1 — n (x, 9°) 


(9) 


7 T (x,9) V V ' ' 1 — 7 T (x,9) 

and consider the extended criterion 4>ext(u,9° ) = min 0e[O) i ] I (9°,9;u) / (9 - 9°f. 
Numerical computation gives that argmax uG |- 0 4>ext (u, 9°) = 7 r, and for this 

choice of u the probability of a false 9 (y) is negligible, because then (7 (xi, 9) , 7 (x 2 , 9)) 
are for 9 = 0 and 9 = 1 as far as possible, the same holds for 7 (x, 9). We took here 
the tuning constant K = 0, otherwise, the optimal u would be between 7r and y 71 "- 
In Fig. lc we present the dependence of / ( 9 °, 9] u) / (9 — 9 0 ) 2 on 9 for different values 
of u. 


T 




Figure 1: (a) the canonical surface, (b) the expectation surface, (c) 

I{9°,9-,u)/{9-9 0 ) 2 


Numerical Example The aim of this example is to show that in the case that the 
design space X = {x 1 ,... ,x fc } is finite, we can compute the extended E-optimum 
design by using the LP—see Remark 2. The systematic part of the considered model 
is taken similar as in Pazman and Pronzato (2014), Example 2, i.e. the mean of y x , 
observed at the design point x = (xi,x 2 ) t is equal to 

77/ 

r] (x, 9) = 7 i 7 r (x, 9) — - (l + 0iXi + 9f(l- xi) + 9 2 x 2 + 9\ (1 — x 2 )) . (10) 


6 
















However, the error structure is quite different. Instead of an error component not 
depending on 9 = (9i,9 2 ) T , we have now a binomial model with y x distributed 
according to ( 2 ), with n — 10 and n (x, 6) from (10). Consequently, in the extended 
criterion (7) we use the binomial /-divergence (9) and p 2 (9°,9) = H# 0 — 9\\ 2 . We 
choose 9° = (1/8,1/8) T , x E X = {0, 0.1,..., 0.9, l} 2 , 9 E 0 = [-1,1] x [0, 2], 

The below used iterative algorithm follows the lines of Pazman and Pronzato 
(2014). 

0. Take any vector (£(°) (ad),..., (x fc )) such that Exe* ^ ( x ) = 1 an d 
£(°\x) > 0 V x E X, choose e > 0, set 0® = 0 and n — 0. Construct a 
finite grid in 0 . 

1. Set 0 (n+1) = 0hdu{6 , ( Tl+1 )}, where 9 (jl+1 '> = argmin@ e © H (u 0°> $) ^ n \ x ) 

is computed as follows: 

(a) Compute #( n +i) = argmin 0 e gc«) Exe* H (u ^°> $) ^ n \ x )• 

(b) Perform local minimization over 0 initialized at 9 (jlJrl \ denote by #( n+ 0 
the solution and set C?( n +i) = g( n ) y 

2. Use the LP solver to find (t^ n+1 \ £0+0 (x 1 ),... ,^ n +i) ( x k ^ so to maximize 
tS n+x > satisfying the constraints: 

• t( n+1 ) > 0, e (n+1) (^) > 0 Vx G A, E* e *f (n+1) 0&) = !, 

• Exe* H ( x , 9 °, 9) £<■ n+1 \x) > t^ n+1 ) \/9 E 0 (n+1 ). 

3. Set A( n +P = —(j) ext (£( n+1 ) ; 9 °), if A (n+1 ) < e, take £( n+1 ) as an e-optimal 
design and stop, or else n E- n + 1 and continue by step 1. 

The computations were performed in Matlab on a bi-processor PC (3.10 Ghz) 
equipped with 6 GB of RAM and with 64 bits Windows 8.1. LP problems were 
solved with interior point method. When the grid Q ^ was taken as a random latin 
hypercube design with 10000 points renormalized to 0 , e = 10 ” 10 , and ^ put mass 
uniformly to each x in X, the algorithm stopped for K = 0 after 14 iterations and 
for K = 10 6 after 20 iterations requiring 15 and 17 s. The obtained numerical results 
are summarized in Table 1. 


Table 1: The (p eE optimal designs for K = 0 and K = 10 6 ; the values of (p eE at £* 
according to (7) for K = 0 and K = 10 6 


K 

r 

0 eE (£*,0°) for K — 0 

4>eE (£*, 9°) for K = 10 6 

0 

f( 0 , 0 ) T ( 0 , 1 ) T ( 1 , 1) T 1 
(0.3464 0.0281 0.6255J 

0.0215 

0.0365 

10 6 

f ( 1 ) o) T ( 0 , 1) T 1 

(0.4921 0.5079 J 

2.17 x 10 ” 9 

0.6666 


For K = 10 6 we obtained almost /^-optimal design. On the other hand, under 
this “A-optimaF design, the value of the criterion (p eE with K = 0 is small, which 
indicates instabilities in the model. 
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